Posted 2026-07-01Updated 2026-07-03Artificial Intelligence36 minutes read (About 5357 words)

VeRL Async Policy

导言

VeRL async 的核心问题不是“开异步就一定更快”，而是把 rollout 长尾、训练更新、参数同步和旧样本容忍度放到同一个队列系统里调参。这篇笔记梳理 VeRL 老版 one_step_off_policy / fully_async_policy 与新版 trainer v1 的关系，解释 staleness 的真实语义，并给出 64P、128P NPU 场景下选择训推资源比例的第一轮计算方法。

结论速记

新版入口应优先看 trainer v1：当前 VeRL main_ppo.py 已经通过 trainer.use_v1=True 进入 TaskRunnerV1，并用 trainer.v1.trainer_mode 选择 sync、colocate_async、separate_async。ppo_trainer.yaml 里默认 use_v1: true、trainer_mode: sync。
旧版 async 仍有参考价值：one_step_off_policy 解释了一步错位 overlap；fully_async_policy 更完整地暴露 staleness_threshold、trigger_parameter_sync_step、partial_rollout 和独立 Trainer/Rollouter 资源配置。
staleness 不是简单“推理完多推百分比”：旧版 fully async 里它表示允许使用旧参数样本的最大比例；V1 replay buffer 里还会记录 trajectory 跨越的模型版本数和相对当前训练步的滞后。
64P/128P 资源配比应先看同步耗时结构：如果同步 profile 显示 rollout:train 约为 3:1，第一轮可以从 train:rollout ≈ 1:3 开始，即 64P 试 16:48，128P 试 32:96，再用 trainer/idle_ratio、rollouter/idle_ratio 和实际 step time 回调。
理想模型和实测要分开看：在 128P 32:96、线性扩展、完全重叠的理想模型里，step time 可以到 32P 同步 baseline 的 0.25；但若同步基线本身没有长度长尾、padding 浪费或额外空转，异步单卡吞吐的理论上界只是 1.00x，不会凭空超过满载同步基线。

新旧实现

VeRL async 现在至少有三层容易混在一起的实现。

one-step overlap

老的 one_step_off_policy 入口是：

python3 -m verl.experimental.one_step_off_policy.async_main_ppo \
  --config-path=config \
  --config-name='one_step_off_ppo_trainer.yaml' \
  actor_rollout_ref.hybrid_engine=False \
  trainer.nnodes=1 \
  trainer.n_gpus_per_node=6 \
  rollout.nnodes=1 \
  rollout.n_gpus_per_node=2

它的机制是训练当前 batch 时异步生成下一个 batch。官方文档给出的近似拆解是：

colocate sync：step ≈ gen + old_log_prob + update_actor
one-step-overlap async：step ≈ wait_prev_gen + old_log_prob + update_actor

所以它主要吃掉的是 rollout 与训练之间的串行等待，但仍然比较像“一步错位”的固定策略，灵活度有限。官方说明里 32 张 H20、Qwen2.5-Math-7B 的例子显示 FSDP2 从 19h18m 降到 15h34m，Megatron 从 18h21m 降到 13h06m。

fully async policy

fully_async_policy 入口仍在 experimental 下：

python -m verl.experimental.fully_async_policy.fully_async_main \
  actor_rollout_ref.hybrid_engine=False \
  actor_rollout_ref.rollout.mode=async \
  trainer.nnodes="${NNODES_TRAIN}" \
  trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
  rollout.nnodes="${NNODES_ROLLOUT}" \
  rollout.n_gpus_per_node="${NGPUS_PER_NODE}" \
  async_training.staleness_threshold="${staleness_threshold}" \
  async_training.trigger_parameter_sync_step="${trigger_parameter_sync_step}" \
  async_training.partial_rollout="${partial_rollout}"

它把系统拆成 Rollouter、MessageQueue、Trainer 和 ParameterSynchronizer 四部分：Rollouter 单样本流式生成，Trainer 从队列取够 require_batches * ppo_mini_batch_size 后训练，训练若干步后触发参数同步。设计文档也明确指出，收益来自训推隔离后把 rollout 和 train 的时间重叠起来。

![AReaL asynchronous architecture](https://pic.shaojiemike.top/shaojiemike/2026/07/8b2f1e903f9ed26e6e37ee6d80103ebe.png){ width=90% }

来自 AReaL Figure 2，展示异步 RL 系统中 Rollout、Replay Buffer、Trainer 与参数服务的解耦关系。

trainer v1

新版 V1 的入口回到主训练命令：

python3 -m verl.trainer.main_ppo \
  trainer.use_v1=True \
  trainer.v1.trainer_mode=colocate_async \
  actor_rollout_ref.rollout.mode=async \
  trainer.v1.colocate_async.num_warmup_batches=1

关键代码路径很短：

trainer_cls = get_trainer_cls(config.trainer.v1.trainer_mode)
config.transfer_queue.enable = True
tq.init(config.transfer_queue)
self.trainer = trainer_cls(config=config)
self.trainer.init()
self.init_agent_loop_manager()
self.trainer.fit(self.agent_loop_manager)

这里值得注意两点：

trainer.v1.trainer_mode 决定具体 trainer 类，注册类包括 PPOTrainerSync、PPOTrainerColocateAsync、PPOTrainerSeparateAsync。
V1 默认启用 TransferQueue，AgentLoop 把生成结果写入队列，Trainer 侧通过 replay buffer 取样训练。

三种模式

flowchart LR
  subgraph S["sync"]
    S1["同一资源池"] --> S2["采样完成"]
    S2 --> S3["训练更新"]
    S3 --> S4["update_weights"]
  end

  subgraph C["colocate_async"]
    C1["同一资源池"] --> C2["warmup batch"]
    C2 --> C3["异步生成入队"]
    C3 --> C4["Trainer 取样训练"]
    C4 --> C5["abort / sleep / resume"]
  end

  subgraph A["separate_async"]
    A1["Trainer 资源池"] --> A3["训练"]
    A2["Standalone rollout 资源池"] --> A4["异步生成"]
    A4 --> A5["TransferQueue"]
    A5 --> A3
    A3 --> A6["按 parameter_sync_step 同步权重"]
    A6 --> A2
  end

sync

sync 是 V1 默认模式：

1
2
3

python3 -m verl.trainer.main_ppo \
  trainer.use_v1=True \
  trainer.v1.trainer_mode=sync

代码注释写得很直接：Trainer 和 rollout colocated，partial rollout disabled；每步结束 update_weights，sample 结束后 sleep_replicas。trainer_sync.py 适合作为 correctness baseline 或对 staleness 极敏感的任务。

colocate_async

colocate_async 仍然是同一组卡共享训练与推理，但生成请求是 fully async client，训练开始前先塞入 warmup batch：

python3 -m verl.trainer.main_ppo \
  trainer.use_v1=True \
  trainer.v1.trainer_mode=colocate_async \
  trainer.v1.colocate_async.num_warmup_batches=1 \
  actor_rollout_ref.rollout.mode=async

代码路径是：get_llm_client() 返回 FullyAsyncLLMServerClient；on_train_begin() 先 _add_batch_to_generate()；on_step_end() 更新权重并 resume_generation_replicas()；on_sample_end() abort 未完成请求并 sleep replicas。E2E smoke test 给了最小可运行配置。

separate_async

separate_async 才是真正把 Trainer 与 rollout 拆资源池的 V1 模式：

python3 -m verl.trainer.main_ppo \
  trainer.use_v1=True \
  trainer.v1.trainer_mode=separate_async \
  trainer.v1.separate_async.num_warmup_batches=4 \
  trainer.v1.separate_async.parameter_sync_step=4 \
  actor_rollout_ref.rollout.mode=async \
  actor_rollout_ref.rollout.nnodes="${ROLLOUT_NNODES}" \
  actor_rollout_ref.rollout.n_gpus_per_node="${ROLLOUT_GPUS_PER_NODE}" \
  actor_rollout_ref.rollout.checkpoint_engine.backend=nccl

这一路径有几条硬约束：

data.train_batch_size == actor_rollout_ref.actor.ppo_mini_batch_size
actor_rollout_ref.rollout.nnodes > 0
actor_rollout_ref.rollout.n_gpus_per_node > 0
rollout checkpoint engine 不能是 naive
如果启用 reward model，separate_async 不支持 colocated RM，要求 reward.reward_model.enable_resource_pool=True

trainer_separate_async.py 还会把 hybrid engine 在 rollout/trainer 状态之间切换；standalone rollout 通过独立 LLMServerManager 提供生成服务，按 parameter_sync_step 同步权重。

Staleness 定义

是不是多推理这些百分比给下一次训练？

不完全是。可以把它直观理解为旧参数样本缓冲区的容忍上限，而不是固定要求 Rollouter 每次都多生成某个百分比。Rollouter 是否真的多生成，取决于 rollout 速度、队列大小、正在运行的请求、参数同步时机和 partial rollout。

在旧版 fully_async_policy 文档里，async_training.staleness_threshold 表示最大允许使用的 stale samples 比例：

rollout_num =
  (1 + staleness_threshold)
  * (trigger_parameter_sync_step * require_batches * ppo_mini_batch_size)
  - num_staleness_sample

源码里的控制也对应这个含义：max_required_samples = required_samples * (staleness_threshold + 1) * trigger_parameter_sync_step，而每次参数变化后会把 active_tasks + queue_size 重新计入 staleness_samples。rollouter 代码和 reset 逻辑都说明它是“队列/正在生成样本有多旧”的约束。

![Staleness gate](https://pic.shaojiemike.top/shaojiemike/2026/07/895e8588fb8869a23884de899d52b259.png){ width=90% }

自绘示意图：staleness 更像一个样本闸门秤，小黑按参数版本把可用样本送给 Trainer，把太旧样本丢弃。

V1 的 staleness 还要多看 replay buffer 指标。ReplayBuffer 会按 prompt 的 global_steps 排序，优先取更老但已完成的 prompt，以降低堆积；如果超过 max_off_policy_threshold，按策略 drop 或 wait 处理。V1 replay buffer 的判断式是：

1	(global_steps - prompt_global_steps + 1) / parameter_sync_step

V1 训练指标还会记录 trajectory 在生成过程中跨越多少版本，以及相对当前策略滞后多少版本：

1
2
3

trajectory_spans = (max_global_steps - min_global_steps + 1) / parameter_sync_step
trajectory_staleness = ((global_steps - 1) - max_global_steps) / parameter_sync_step
trajectory_staleness_worst = ((global_steps - 1) - min_global_steps) / parameter_sync_step

所以不要把所有 staleness 都理解成同一个标量：

**旧 fully async 的 staleness_threshold**：允许 stale samples 的比例上限。
**V1 sampler 的 max_off_policy_threshold**：trajectory 可跨越的模型版本阈值。
训练日志里的 trajectory staleness：实际样本相对当前训练步的版本滞后。

资源配比

资源配比的第一原则是：让 rollout 时间和 train 时间在异步流水线里尽量接近。VeRL fully async 文档也给了同样建议：理想资源分配应让 rollout time 与 train time 接近，减少 pipeline bubble；如果 rollouter/idle_ratio 高而 trainer/idle_ratio 低，就增加 Trainer 资源、减少 Rollouter 资源，反之亦然。调参建议

如果同步 profile 里 rollout:train 约为 3:1，且暂时假设训推随卡数近似同指数扩展，那么第一轮可以按：

1 2	T_rollout / R_rollout ≈ T_train / R_train R_rollout : R_train ≈ T_rollout : T_train ≈ 3 : 1

对应建议是：

总卡数	第一轮配比	备选配比	适用判断
64P	`16 train : 48 rollout`	`24:40`、`32:32`	如果训练显存或 DP/TP 切分要求更高，从 `24:40` 起步更稳。
128P	`32 train : 96 rollout`	`40:88`、`48:80`、`64:64`	如果 rollout 长尾极重，优先试 `32:96`；如果 update_actor 已接近瓶颈，试 `48:80` 或 `64:64`。

不要只看同步阶段比例

同步 profile 的 gen:update_actor 比例是起点，不是最终答案。异步后 old log prob、reward、参数同步、队列等待、sleep/resume、NPU 通信后端都会改变瓶颈。资源比例必须通过 timing_s/step、timing_s/gen、trainer/idle_ratio、rollouter/idle_ratio 和 off-policy 指标回调。

idle ratio 的读法

Fully async 下推理和训练同时出现 idle ratio 是可能的，但不代表系统已经均衡。若同一窗口内 rollouter/idle_ratio≈10%、trainer/idle_ratio≈50%，更像是 rollout 仍是主瓶颈：推理侧 90% 时间在忙，Trainer 经常等队列凑够训练批次。若两边在同一段 wall-clock 同时 idle，则要优先排查参数同步、队列门槛、checkpoint/通信、Ray 调度、sleep/resume 或指标统计窗口不一致。实践上应结合 queue_size、active_tasks、stale/drop samples 和 step timing 判断；这种情况下通常不要继续加 Trainer 卡，而应先尝试增加 rollout 资源、降低训练侧等待门槛，或把部分 Trainer 卡让给 Rollouter。

异步收益边界

上异步后，合理配比下全局单步耗时应当降低，但单卡吞吐不一定上升。例如 32P 同步 baseline 若可拆成 R+R+R+T=4 个单位，128P 按 32T:96R 拆分后，理想 staleness=0 是 RRR 并行后再训练，耗时 2/4=0.5；理想 staleness=1 是 rollout 与 train 稳态重叠，耗时 1/4=0.25。

单卡吞吐上界

如果没有不同长度输出带来的长尾等待，也没有 padding、动态 batching 或同步 barrier 的额外浪费，那么异步本身不能让单卡吞吐超过同步满载基线。原因很简单：异步改变的是调度顺序，不改变每个逻辑 update 必须完成的总计算量。staleness 能改变的是 rollout 与 train 的重叠程度，也就是能隐藏多少 pipeline bubble。

这里先用估算脚本里的抽象来写公式：令 s∈[0,1] 表示 bubble 被隐藏的比例。这个 s 不是 VeRL 配置里的原始 staleness_threshold，而是经过系统行为映射后的重叠系数：s=0 表示没有稳态重叠，s=1 表示 rollout 与 train 完全稳态重叠。

把一轮 update 的 rollout 和 train 都换算成 card-seconds：

W_r：rollout 必须完成的有效计算量。
W_t：train 必须完成的有效计算量。
C_r：rollout 分到的卡数。
C_t：train 分到的卡数。
C = C_r + C_t：总卡数。

先定义两个阶段在当前资源切分下的服务时间：

a = W_r / C_r
b = W_t / C_t
M = max(a, b)
m = min(a, b)

当 s=0 时，两个阶段没有稳态重叠，时间近似是 a+b；当 s=1 时，短阶段完全被长阶段覆盖，时间近似是 M。因此一个简单的中间态模型是：

1
2
3

A(s) = M + (1 - s) * m
     = a + b - s * m
     其中 0 <= s <= 1

所以 staleness 增大时，A(s) 会从 a+b 单调下降到 M，但不会低于 M。也就是说：

1	A(s) >= M = max(W_r / C_r, W_t / C_t)

接下来用加权平均不等式得到资源下界：

max(W_r / C_r, W_t / C_t)
  >= (C_r * (W_r / C_r) + C_t * (W_t / C_t)) / (C_r + C_t)
  =  (W_r + W_t) / (C_r + C_t)
  =  (W_r + W_t) / C

为什么这是加权平均不等式

令：

1
2
3

a = W_r / C_r
b = W_t / C_t
M = max(a, b)

因为 a <= M、b <= M，所以任何正权重加权平均都不会超过 M。这里用卡数做权重：

1
2
3

(C_r * a + C_t * b) / (C_r + C_t)
  <= (C_r * M + C_t * M) / (C_r + C_t)
  = M

也就是：

1	max(a, b) >= (C_r * a + C_t * b) / (C_r + C_t)

把 a = W_r / C_r、b = W_t / C_t 代进去，就得到正文里的不等式。这里权重必须选 C_r 和 C_t，因为这样右侧会化简成：

1	(W_r + W_t) / (C_r + C_t)

它的含义是：W_r + W_t 是总 card-seconds 工作量，C_r + C_t 是总卡数。哪怕调度完全理想，总耗时也不能低于 总工作量 / 总资源。所以这一步不是随便平均 rollout 耗时和 train 耗时，而是在把两个阶段统一换算到同一个资源下界。

这就是资源下界：总卡数为 C 时，完成 W_r + W_t 的有效工作至少需要 (W_r + W_t) / C 秒。若 32P 同步基线已经满载，其时间是：

1	A_32 = (W_r + W_t) / 32

那么相对 32P 同步基线的单卡吞吐为：

1
2
3

per_card_throughput(s)
  = (A_32 / A(s)) * (32 / C)
  <= 1

等号成立的条件也很严格：s=1，W_r / C_r = W_t / C_t，并且没有参数同步、队列等待、通信和调度开销。前面的 R:T=3:1、128P 32T:96R 正好满足这个比例。此时：

W_r = 3 * 32 = 96
W_t = 1 * 32 = 32
C_r = 96
C_t = 32

a = 96 / 96 = 1
b = 32 / 32 = 1
A(s) = 1 + (1 - s) * 1 = 2 - s

所以 staleness=0 的估算时间是 A(0)=2，相对 32P baseline A_32=4 是 0.50，单卡吞吐是 (4 / 2) * (32 / 128) = 0.50x；staleness=1 的估算时间是 A(1)=1，相对 32P baseline 是 0.25，单卡吞吐是 (4 / 1) * (32 / 128) = 1.00x。

什么时候会超过同步基线

如果实测异步单卡吞吐超过同步基线，通常不是违反了上面的下界，而是同步基线并不是“满载有效计算”的理想基线。例如 response length 长尾导致同步 batch 等最慢样本、padding 让短样本做了无效计算、动态 batching 提高了推理引擎效率，或同步实现里有额外 barrier、权重同步、checkpoint、Ray 调度空转。此时异步提升的是有效 work / 总 card-seconds，不是让同一份有效 work 用少于物理下界的 card-seconds 完成。

它有效的条件是：

rollout 长尾明显，Trainer 经常等生成。
rollout 与 train 可以真正重叠，而不是被参数同步、显存切换或队列阻塞重新串行化。
stale 样本比例可控，没有大量 drop/wait。
训推资源比例接近瓶颈均衡。
reward model、old log prob、ref log prob 等环节没有成为新瓶颈。

反过来，以下情况会让异步收益低于这个理想模型：

rollout 卡太少，Trainer 仍然长期等队列。
rollout 卡太多，Rollouter 高 idle，单卡吞吐变差。
parameter_sync_step 太小，参数同步过于频繁。
staleness 太大，旧样本带来 off-policy 偏差，精度或 response length 不稳定。
separate_async 下 checkpoint engine、NCCL/NIXL/Mooncake 后端不稳定。

VeRL 文档里的 128 卡 Qwen2.5-Math-7B 实验可以作为参考，但不能机械外推：

staleness_threshold	step	gen	update_actor	400 step 总时长	acc/mean@1
0	231.34	128.47	98.77	1d 1h 53m	max 0.2844 / last 0.2604
0.1	171.30	58.17	109.12	19h 59m	max 0.3542 / last 0.2979
0.3	146.11	38.88	103.22	17h 20m	max 0.3469 / last 0.2865
0.5	150.63	33.14	113.16	17h 22m	max 0.3521 / last 0.3094

这组数据说明三件事：

从 0 到 0.1/0.3/0.5，step time 明显下降。
0.3 与 0.5 的时间很接近，收益不是线性增加。
精度没有单调随 staleness 变化，文档也提示 response length 和训练稳定性会干扰结论。

![AReaL staleness ablation](https://pic.shaojiemike.top/shaojiemike/2026/07/f63b2b358eeebfa6c059acacb2ddf5bf.png){ width=90% }

来自 AReaL Figure 5：naive PPO 对 staleness 更敏感；结合 decoupled objective 后，中等 staleness 可提升吞吐并维持效果。

估算脚本

我把一个轻量计算脚本放到独立仓库：Kirrito-k423/verl-perf。

默认假设：

32P 同步 baseline。
rollout:train 时间约为 3:1。
默认按线性扩展估算；如果实测扩展较差，可以用 --train-scale-exp 和 --rollout-scale-exp 调低。
默认热力图显示 normalized_step_time = step_time / 32P_sync_step_time。
staleness 通过减少 pipeline bubble 改善 step time；这是第一轮 sizing 模型，不替代真实压测。

归一化口径

这里的 step_time_heatmap 不是绝对秒数，而是相对 32P 同步 baseline 的比例。以 baseline_rollout_time=3、baseline_train_time=1 为例，32P 同步一步是 R+R+R+T=4；128P 32:96 且 staleness=0 时是 max(3/3, 1) + min(3/3, 1) = 2，所以显示 0.50；staleness=1 时稳态完全重叠，显示 0.25。

运行：

python3 verl_perf_heatmap.py \
  --total-cards 128 \
  --baseline-cards 32 \
  --baseline-rollout-time 3 \
  --baseline-train-time 1 \
  --out-dir outputs/128p

输出：

verl_perf_grid.csv
step_time_heatmap.png
per_card_throughput_heatmap.png
async_allocation_timeline.png

![128P async allocation timeline](https://pic.shaojiemike.top/shaojiemike/2026/07/f616f071b207cf8d270039ceaf6b294d.png){ width=95% }

典型 `32 train : 96 rollout` 分配：32P baseline 为 `R+R+R+T`；128P 在 `staleness=0` 时先并行 rollout 再 train，归一化时间为 `0.50`；`staleness=1` 稳态完全重叠时为 `0.25`。

![128P step time heatmap](https://pic.shaojiemike.top/shaojiemike/2026/07/0491b2d03aed1732daefb5b56ea4d0cc.png){ width=95% }

128P 默认估算：热力图显示相对 32P 同步 baseline 的 step time 比例。`32:96` 在 `staleness=0` 为 `0.50`，在 `staleness=1` 为 `0.25`。

![128P per card throughput heatmap](https://pic.shaojiemike.top/shaojiemike/2026/07/d9d2ee80cf93972b8cf9ec6bc07d05a0.png){ width=95% }

128P 默认估算：相对 32P 同步 baseline 的单卡吞吐在 `32:96, staleness=1` 达到 `1.00x`；如果切分偏离瓶颈比例或 staleness 不足，单卡效率会下降。

压测顺序

128P 场景建议先跑 32:96, staleness=0/0.5/1.0，再跑 48:80, staleness=0.5/1.0、64:64, staleness=0.5；64P 场景先跑 16:48、24:40、32:32。每组至少记录 step 分解、idle ratio、stale/drop 样本数、response length 和验证集指标。

参考文献

VeRL main_ppo.py V1 入口：https://github.com/verl-project/verl/blob/355bf40c734b9f743292dfc83f971b2ca1874cff/verl/trainer/main_ppo.py#L129-L179
VeRL V1 trainer 配置：https://github.com/verl-project/verl/blob/355bf40c734b9f743292dfc83f971b2ca1874cff/verl/trainer/config/ppo_trainer.yaml#L200-L248
VeRL sync / colocate_async / separate_async trainer：https://github.com/verl-project/verl/tree/355bf40c734b9f743292dfc83f971b2ca1874cff/verl/trainer/ppo/v1
VeRL fully async 文档：https://github.com/verl-project/verl/blob/355bf40c734b9f743292dfc83f971b2ca1874cff/docs/advance/fully_async.md
VeRL one-step-off 文档：https://github.com/verl-project/verl/blob/355bf40c734b9f743292dfc83f971b2ca1874cff/docs/advance/one_step_off.md
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning：https://arxiv.org/abs/2505.24298
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation：https://arxiv.org/abs/2504.15930
AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training：https://arxiv.org/abs/2507.01663
Magistral：https://arxiv.org/abs/2506.10910

VeRL Async Policy

http://icarus.shaojiemike.top/2026/07/01/Work/Artificial Intelligence/Training/PostTrain/VeRLAsync/

Author

Shaojie Tan

Posted on

2026-07-01

Updated on

2026-07-03

Licensed under

VeRL Async Policy

结论速记

新旧实现

one-step overlap

fully async policy

trainer v1

三种模式

sync

colocate_async

separate_async

Staleness 定义

资源配比

异步收益边界

单卡吞吐上界

估算脚本

相关论文

参考文献

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Catalogue

Categories

Subscribe for updates

follow.it

Links

Recents

Archives

Tags